AD-A213  966 


LABORATORY  FOR 
COMPUTER  SCIENCE 


MASSACHUSETTS 
INSTITUTE  OF 
TECHNOLOGY 


MIT/LCS/TM-396 


ADAPTIVE  BACKOFF 
SYNCHRONIZATION 
TECHNIQUES 


Anant  Agarwal 
Mathews  Cherian 


DT1C 

SELECTE 
OCT,  3X1389 

CP  B 

June  1989 


TECHNOLOGY  SQUARE.  CAMBRIDGE.  MASSACHUSETTS  1)2139 


J|p| wovmd  fur  pttiJlc  r*»l*a**; 

DMhillkn  OnlUrtltnd 


10 


'■>  *<3 

r  \  * 

^  '  U 


216 


HMIEftfig 


UNIFICATION  OF  THIS  PAGE 


la.  REPORT  SECURITY  CLASSIFICATION 

Unclassified 

2a.  SECURITY  CLASSIFICATION  AUTHORITY 

2b.  DECLASSIFICATION /DOWNGRADING  SCHEDULE 

4.  PERFORMING  ORGANIZATION  REPORT  NUMBER(S) 

MIT/LCS/TM-396 

6a.  NAME  OF  PERFORMING  ORGANIZATION 

6b.  OFFICE  SYMBOL 

MIT  Laboratory  for  Computer 

(If  applicable) 

Science 

6c  ADDRESS  (City,  State,  and  ZIP  Code) 

545  Technology  Square 

Cambridge,  MA  02139 

8a.  NAME  OF  FUNDING  /SPONSORING 

8b.  OFFICE  SYMBOL 

ORGANIZATION 

(If  applicable) 

DARPA/DOD 

8c.  ADDRESS  (City,  State,  and  ZIP  Code) 

1400  Wilson  Boulevard 

Arlington,  VA  22217 

REPORT  DOCUMENTATION  PAGE 


lb  RESTRICTIVE  MARKINGS 


3.  DISTRIBUTION /AVAILABILITY  OF  REPORT 

Approved  for  public  release;  distribution 
is  unlimited. 


5  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 
N00014-87-K-0825 


7a  NAME  OF  MONITORING  ORGANIZATION 

Office  of  Naval  Research/ Department  of  Na-v 


7b.  ADDRESS  (City,  State,  and  Zll>  Code) 
Information  Systems  Program 

Arlington,  VA  22217 


10.  SOURCE  OF  FUNDING  NUMBERS 


PROGRAM 
ELEMENT  NO. 


PROJECT 

NO 


WORK  UNIT 
ACCESSION  NO 


Adaptive  Backoff  Synchronization  Techniques 


12.  PERSONAL  AUTHOR(S) 


13a.  TYPE  OF  REPORT 
Technical 


13b.  TIME  COVERED 
FROM  TO 


14.  DATE  OF  REPORT  (Year,  Month,  Day)  Il5  PAGE  COUNT 

1989  June  I  21 


COSATI  CODES 


FIELD  GROUP  |  SUB-GROUP 


18.  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 
Barrier  Synchronization,  Multistage  Networks,  Hot-Spot 
Contention,  Shared  Memory  Multiprocessors 


19.  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

— >  Shared-memory  multiprocessors  commonly  use  shared  variables  for  synchronization.  Our 
simulations  of  real  parallel  applications  show  that  large-scale  cache-coherent  multiproces¬ 
sors  suffer  signficant  amounts  of  invalidation  traffic  due  to  synchronization.  Large 
multiprocessors  that  do  not  cache  synchronization  variables  are  often  more  severely  im¬ 
pacted.  If  this  synchronization  traffic  is  not  reduced  or  managed  adequately,  synchroni¬ 
zation  references  can  cause  severe  congestion  in  the  network.  We  propose  a  class  of 
adaptive  backoff  methods  that  do  not  use  any  extra  hardware  and  can  significantly  reduce 
the  memory  traffic  to  synchronization  variables.  These  methods  use  synchronization  state 
to  reduce  polling  of  synchronization  variables.  Our  simulations  show  that  when  the 
number  of  processors  participating  in  a  barrier  synchronization  is  small  compared  to  the 
time  of  arrival  of  the  processors,  reductions  of  20  percent  to  over  95  percent  in  synchro¬ 
nization  traffic  can  be  achieved  at  no  extra  cost.  In  other  situations  adaptive  backoff 
techniques  result  in  a  tradeoff  between  reduced  network  accesses  and  increased  processor^} 


20.  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT 

El  UNCLASSIFIED/UNLIMITED  □  SAME  AS  RPT.  □  OTIC  USERS 


22*.  NAME  OF  RESPONSIBLE  INDIVIDUAL 

Judy  Little,  Publications  Coordinator 


DO  FORM  1473,  84  MAR  63  APR  edition  may  be  used  until  exhausted. 

All  other  editions  are  obsolete 


21.  ABSTRACT  SECURITY  CLASSIFICATION 

Unclassified  .  /  ■ 


22b  TELEPHONE  (Include  Area  Code)  22c.  OFFICE  SYMBOL 
(617)  253-5894 


SECURITY  CLASSIFICATION  OF  THIS  PAGE 

•US.  Ownumwt  MnM)  Ottim:  IMS  ad-SST 


Unclassified 


Adaptive  Backoff  Synchronization  Techniques1 


Anant  Agarwal  and  Mathews  Cherian 
Laboratory  for  Computer  Science 
Massachusetts  Institute  of  Technology 
Cambridge,  MA  02139 


Abstract 

Shared-memory  multiprocessor*  commonly  nse  shared  variables  for  synchronization.  Onr  simulations  of 
real  parallel  applications  show  that  large-scale  cache-coherent  multiprocessors  suffer  significant  amounts  of 
invalidation  traffic  dne  to  synchronization.  Large  multiprocessors  that  do  not  cache  synchronisation  vari¬ 
ables  are  often  more  severely  impacted.  If  this  synchronization  traffic  is  not  rednced  or  managed  adequately, 
synchronisation  references  can  cause  severe  congestion  in  the  network.  We  propose  a  class  of  adaptive  backoff 
methods  that  do  not  use  any  extra  hardware  and  can  significantly  reduce  the  memory  traffic  to  synchro¬ 
nization  variables.  These  methods  use  synchronisation  state  to  reduce  polling  of  synchronisation  variables. 
Our  simulations  show  that  when  the  number  of  processors  participating  in  a  barrier  synchronisation  is  small 
compared  to  the  time  of  arrival  of  the  processors,  reductions  of  20  percent  to  over  95  percent  in  synchro¬ 
nisation  traffic  can  be  achieved  at  no  extra  cost.  In  other  situations  adaptive  backoff  techniques  result  in  a 
tradeoff  between  reduced  network  accesses  and  increased  processor  idle  time. 
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1  Introduction 


Processor  self-scheduling  schemes  in  shared-memory  multiprocessors  commonly  use  shared  variables  to  synchro¬ 
nise  activities  among  processors  [6,  22,  15].  This  use  of  synchronisation  variables  often  leads  to  widespread 
sharing  among  processors.  Our  trace-driven  simulations  of  parallel  applications  show  that  these  widely  shared 
synchronisation  variables  adversely  impact  the  performance  of  large-scale  multiprocessors,  cache-coherent  or 
otherwise. 

In  systems  without  hardware  support  for  cache  coherence,  such  as  the  IBM  RP3  (18],  Ultracomputer  [9], 
Cedar  [7],  these  references  to  shared  variables  must  traverse  the  interconnection  network.  Not  only  do  synchro¬ 
nisation  references  consume  a  significant  fraction  of  the  network  bandwidth,  but  more  important,  a  widely-shared 
synchronization  variable  (such  as  in  a  barrier  synchronization)  will  result  in  heavy  traffic  to  the  same  location 
in  memory  and  cause  hot-spot  contention  problems  [19]. 

On  the  other  hand,  in  systems  that  use  directory  schemes  to  maintain  cacao  coherence,  we  show  that  syn¬ 
chronization  variables  result  in  excessive  invalidation  traffic  when  the  number  of  pointers  in  the  cache  directory 
is  limited.  A  potential  solution  for  the  cache  directories  would  be  to  implement  software  combining  trees  [25] 
for  synchronization  variables.  As  long  as  the  degree  of  the  node*  in  the  combining  tree  is  less  than  tbe  number 
of  pointer*  the  cache-directory,  then  synchronization  variables  will  not  result  in  extra  invalidation  traffic. 
We  are  currently  investigating  this  approach  and  will  not  address  it  here.  An  alternate  method  is  to  disallow 
caching  of  synchronization  variables. 

In  this  paper  we  consider  software  schemes  to  reduce  the  number  of  synchronization  spins  in  multiprocessors 
that  do  not  cache  their  synchronization  variables.  We  propose  a  set  of  adaptive  backoff  techniques  which  make 
use  of  available  synchronization  state  information  in  order  to  “back  off”  and  postpone  polling  a  synchronisation 
variable. 

The  general  idea  of  backoff  has  been  used  in  one  form  or  another  in  a  number  of  applications.  The  approach 
was  first  used  in  Aloha  [1],  a  radio-based,  packet-switching  network.  If  a  collision  occurred  in  the  network, 
each  source  would  backoff’  for  a  random  interval  before  attempting  to  retransmit.  The  Ethernet  [16]  went  one 
step  further  and  used  a  random  retransmission  interval  in  which  collision  history  influenced  the  choice  of  the 
mean  of  the  random  intervals.  Adaptive  control  schemes  for  multiple  access  communications  networks  have 
been  analyzed  in  [13,  12,  14].  In  addition  to  backoff  history,  we  use  information  such  as  the  expected  time  that 
the  resource  becomes  available,  or  the  network  load,  and  adapt  to  the  current  circumstances. 

We  evaluate  the  performance  of  adaptive  backoff  synchronization  techniques  by  applying  them  to  the  bar¬ 
rier  synchronization.  Barrier  synchronizations  are  commonly  used  in  applications  to  guarantee  that  all  proces¬ 
sors  have  reached  a  point  in  a  program  before  proceeding. 

This  paper  focuses  on  barriers  implemented  using  two  shared  variables  with  busy  waiting  (or  spinning)  on 
synchronization  variables  [22]  (described  in  detail  in  a  later  section).  While  this  form  of  implementation  is 
quite  common,  especially  when  exploiting  fine-grain  parallelism,  alternate  barrier  implementations  might  use 
a  scheme  where  all  but  the  last  processor  to  arrive  at  the  barrier  are  put  to  sleep  (or  blocked).  Reactivation 
of  the  processors  is  contingent  on  a  condition  variable  signalled  when  the  last  process  errives  at  the  barrier. 
This  method  avoids  the  extra  network  traffic  of  polling  a  barrier  flag,  but  incurs  the  potentially  high  overhead 
of  enqueuing  a  process  on  a  condition  variable.  Often,  the  choice  of  busy  waiting  or  blocking  cannot  be  made 
at  compile  time  due  to  uncertainty  in  execution  times  of  processes.  In  such  cases,  our  adaptive  methods  can 
be  used  to  decide  when  it  might  be  best  to  take  a  busy -waiting  process  out  of  circulation  and  queue  it  on  a 
condition  variable  as  explained  in  a  later  section. 

Hardware  support  for  barriers  has  also  been  proposed  in  several  forms.  The  RP3  [18]  proposed  using 
a  combining  network  in  which  the  switches  contain  special  hardware  to  combine  simultaneous  data  accesses 
destined  to  the  same  location  in  memory  and  forward  one  request.  This  would  eliminate  contention  in  the 
network  and  at  the  memory  modules,  but  RP3  cost  estimates  for  this  approach  predict  that  switch  size  and/or 
cost  for  a  2  X  2  switch  could  increase  by  a  factor  between  6  and  32.  Several  cache-coherent  multiprocessors  allow 
simultaneous  invalidates  of  all  cached  copies  of  a  block.  In  such  systems  ail  repeat  accesses  of  a  synchronisation 
variable  can  be  satisfied  by  the  cache.  However,  the  need  to  rely  on  resources  that  can  support  broadcast 
invalidates,  such  as  a  shared  bus,  limits  the  scalability  of  such  systems.  The  PAX  computer  [10]  uses  special 
global-synchronization  logic  implemented  in  hardware  to  allow  low-latency,  low-cost  barrier  synchronisation. 
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Issues  which  wise  with  this  approach  concern  flexibility  in  allowing  multiple  numbers  of  bwriers  to  execute 
simultaneously  with  varying  numbers  of  processors. 

Our  results  show  that  backoff  techniques  applied  to  barriers  yield  reductions  in  synchronization  traffic  by  20 
percent  to  over  95  percent  in  cases  where  the  number  of  processors  involved  in  the  bwrier  is  small  compwed  to 
the  time  of  wrival  between  processors.  In  other  situations,  these  schemes  provide  a  tradeoff  between  cost  (in 
terms  of  processor  idle  time)  and  performance.  The  user  can  determine  this  tradeoff  depending  on  pwticulw 
needs  or  the  application  being  run.  We  also  discuss  other  applications  of  adaptive  backoff  schemes  in  Section  8. 

The  rest  of  this  paper  is  organized  as  follows.  We  first  present  results  from  our  trace-driven  simulations 
describing  how  synchronization  impacts  large-scale  multiprocessors.  We  then  describe  the  network  model  that 
we  assume  for  this  study.  Section  4  presents  the  adaptive  backoff  synchronizations  techniques  as  they  apply 
to  barriers.  We  then  discuss  the  barrier  evaluation  model  and  our  simulation  methodology.  We  evaluate  these 
ideas  and  discuss  the  tradeoffs  involved  in  their  implementation  using  a  simple  analytical  model  and  through 
simulations  in  Sections  6  and  7.  Sections  8  and  9  suggest  extensions  to  our  work  and  summwize  our  findings. 

2  The  Synchronization  Problem 

In  this  section  we  present  data  from  trace-driven  simulations  of  the  FFT  [4],  SIMPLE  [5],  and  WEATHER  [11] 
applications  and  explain  why  synchronization  is  a  problem  in  lwge-scale  systems.  We  will  illustrate  the  problem 
through  the  barrier  synchronization  example. 

A  typical  implementation  of  a  bwrier  might  use  a  shared  vwiable  whose  initial  value  is  zero.  Each  processor 
wriving  at  the  bwrier  increments  the  shwed  vwiable.  If  the  variable  attains  the  value  N,  implying  that  all  N 
processors  have  reached  the  bwrier,  the  processor  can  proceed.  Otherwise,  it  repeatedly  tests  the  bwrier  until  the 
above  condition  is  true.  The  increment  operation  on  the  bwrier  vwiable  must  be  atomic.  This  implementation 
has  the  drawback  that  each  processor  attempting  to  increment  the  bwrier  vwiable  must  contend  with  all  the 
others  simply  polling  it  to  test  for  the  proceed  condition. 

A  better  implementation,  e.g.,  Tang  and  Yew’s  [22],  splits  the  bwrier  into  two  shwed  vwiables:  an  incre¬ 
menting  variable  (henceforth  called  the  bwrier  vwiable)  initially  set  to  zero,  and  a  bwrier  flag  vwiable  also 
initially  reset.  An  arriving  processor  increments  the  bwrier  vwiable.  If  the  vwiable ’s  value  is  less  than  N,  the 
processor  polls  the  bwrier  flag  which  is  set  by  the  last  processor  to  reach  the  bwrier.  Even  this  scheme  requires 
that  the  last  processor  to  reach  the  bwrier  compete  with  the  N-l  processors  testing  the  bwrier  flag  when  it 
tries  to  set  the  flag. 

The  important  point  to  note,  however,  is  that  in  both  implementations,  the  shwed  vwiables  involved  we 
necesswily  shwed  among  all  processors  in  the  system.  It  is  precisely  this  widespread  shwing  which  impacts 
performance  when  scaling  to  lwge  systems. 

2.1  Synchronization  References  and  Scalability 

The  widespread  shwing  that  occurs  with  synchronization  vwiables  is  not  a  problem  when  used  in  bus-based 
snoopy-cache  multiprocessors  [8,  23].  Because  snoopy-cache-based  protocols  perform,  broadcast  invalidates  or 
updates,  a  vwiable  shwed  among  all  processors  generates  no  more  traffic  on  the  shwed  bus  than  a  vwiable 
shared  among  only  two  processors.  The  limitation  of  snoopy-based  schemes,  however,  is  that  they  do  not  scale 
to  large  multiprocessor  systems.  Since  these  schemes  require  low  latency  broadcasts  for  cache  coherence,  as  well 
as  the  ability  to  “watch”  all  bus  transactions,  they  must  use  a  shwed  bus  for  communication.  A  single  bus 
cannot  offer  the  bandwidth  demanded  by  lwge-scale  shwed-memory  multiprocessors. 

Unfortunately  widespread  shwing  of  synchronization  variables  can  drastically  impair  performance  in  lwge-  _ 
scale  multiprocessors,  cache-coherent  or  otherwise.  First,  let  us  consider  multiprocessors  with  coherent  caches,  — — — y 
where  a  directory  is  used  to  keep  track  of  cached  copies  of  shwed  blocks.  In  general,  for  every  memory  block,  m 

a  directory  must  store  as  many  pointers  as  the  numbeT  of  processors  (say  N)  in  the  system  [3].  Such  a  scheme  G 

is  termed  Dir^NB,  for  N-pointers-No-Broadcast  in  [2].  In  practice,  it  is  possible  to  maintain  just  i  pointers  D 

(»  <  N)  to  yield  the  Dir,  JVfl  scheme  [2].  Invalidations  are  forced  to  limit  the  cached  copies  of  a  block  to  i,  or  to  ' - 

gain  exclusive  ownership  on  a  write.  Results  in  [2]  showed  that  during  an  invalidation  situation,  few  invalidations - 

3y - — - - 

2  _D i 3t r Ibut 1 on/ _ 

I  Availability  Codes 
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Figure  1:  Cache  invalidation  statistics  for  SIMPLE  with  64  processors.  The  height  of  a  bar  at  x  reflects  the  fraction 
of  write  hits  to  previously  clean  blocks  that  resulted  in  x  invalidation  messages. 

were  actually  necessary.  Results  from  our  trace-driven  simulations  of  64-processor  systems  discussed  below  as 
well  as  the  results  in  [24]  corroborate  the  findings  in  [2]. 

Figure  1  shows  an  invalidation  histogram  for  a  64-processor  simulation  of  Dirs  N B  driven  by  a  trace  from  the 
SIMPLE  application.  We  also  ran  simulations  on  FFT  and  WEATHER  application  traces  with  64  processors.3 
The  simulations  used  direct-mapped  caches  of  sire  256KBytes  and  block  size  16  bytes.  The  graph  shows  the 
histogram  of  the  number  of  invalidations  required  during  a  write  to  a  previously  clean  block.  We  see  that  in 
over  95  percent  of  the  times  that  an  invalidation  occurred  (in  both  16  and  64  processor  simulations),  a  block 
had  to  be  invalidated  from  no  more  than  three  caches.  Invalidation  histograms  for  FFT  and  WEATHER  had 
a  corresponding  figure  of  over  99  percent.  The  graphs  shows  the  percentage  of  writes  which  resulted  in  invali¬ 
dations  to  up  to  12  caches.  Writes  resulting  in  invalidations  of  greater  numbers  of  caches  were  proportionately 
insignificant. 

Why  do  synchronization  references  hurt  performance?  Our  simulations  revealed  that  synchronization  vari¬ 
ables  were  largely  responsible  for  the  cases  in  which  more  than  three  caches  were  invalidated.  Synchronization 
references  are  even  more  damaging  when  the  effect  of  simultaneous  read  sharing  is  considered.  Recall  that  using 
i  pointers  limits  simultaneous  read  sharing  of  a  block  to  only  i  copies,  and  invalidations  must  occur  to  enforce 
this  rule.  For  synchronizations  like  barriers,  active  sharing  might  occur  among  all  processors  involved,  resulting 
in  a  high  invalidation  rate  in  directory-based  schemes. 

Table  1  shows  the  fraction  of  synchronization  references  out  of  the  total  number  of  synchronization  ref¬ 
erences  which  resulted  in  an  invalidation.  The  percentage  is  far  higher  than  the  corresponding  fraction  for 
non-synchronization  data  references.  The  values  in  the  table  are  slightly  pessimistic,  because  the  processors 
were  simulated  to  make  memory  requests  in  round  robin  fashion  (see  Section  A  in  the  Appendix  for  more 
details).  In  all  cases  the  percentages  of  references  resulting  in  invalidations  for  both  non-synchronization  and 
synchronization  references  improves  as  the  number  of  pointers  in  the  scheme  increases  from  two  to  three. 

It  is  clear  that  invalidation  traffic  due  to  synchronizations  can  be  deleterious  to  the  performance  of  cache- 
coherent  multiprocessors.  One  solution  is  to  use  software  combining  trees.  Alternatively,  one  can  disallow 

2  See  Section  A  in  the  Appendix  for  a  description  of  the  application*,  the  tracing  technique,  and  multiprocessor  simulation 
methodology. 
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Table  1:  Percentage  of  synchronization  and  non-synchronization  references  that  canse  invalidations  in  directory  schemes 
with  2,  3,  4,  5,  and  64  pointers.  Synchronization  references  comprised  0.2%,  7.9%,  and  5.3%  of  the  data  references  in 
FFT,  WEATHER,  and  SIMPLE  respectively. 


caching  synchronization  variables. 

2.2  Disallowing  Caching  of  Synchronization  Variables 

If  most  synchronization  accesses  cause  invalidations  that  involve  multiword  transfers,  then  why  cache  synchro¬ 
nization  variables?  The  problems  with  this  approach  are  similar  to  those  in  multiprocessors  that  make  all  shared 
locations  uncacheable:  increased  network  traffic  and  potential  hot-spot  contention.  Synchronization  references, 
such  as  those  due  to  a  barrier,  are  often  to  the  same  location  in  memory  and  only  a  small  percentage  of  all 
data  accesses  to  the  same  “hot”  module  can  cause  tree  saturation  [19]  in  the  interconnection  network  and  a 
corresponding  severe  drop  in  the  effective  memory  bandwidth. 

Table  2  shows  that  the  percentage  of  uncached  synchronization  traffic  to  memory  out  of  the  total  data  traffic 
can  be  large.  We  compute  traffic  to  memory  by  summing  the  total  number  of  network  transactions  generated 
by  references.  For  example,  in  the  case  of  a  cache  miss,  two  network  transactions  are  generated:  one  to  send 
the  requested  address  to  memory  and  one  to  send  the  requested  data  from  memory  to  the  processor. 

The  reason  SIMPLE  and  WEATHER  generate  far  more  synchronization  traffic  than  FFT  is  that  their  load 
balancing  is  not  as  good  as  in  FFT  (see  Section  A  for  details),  resulting  in  more  synchronization  accesses  at  loop 
barriers  as  processors  wait  for  all  processors  to  arrive.  The  slight  relative  increase  of  synchronization  overhead 
in  all  cases  when  going  from  two  to  five  pointers  is  because  synchronization  traffic  remained  constant  while 
invalidation  traffic  (part  of  total  memory  traffic)  decreased  as  more  pointers  were  available  for  sharing  of  blocks. 

Therefore,  if  we  are  to  scale  multiprocessors,  network  traffic  due  to  synchronization  must  be  rigorously 
minimized. 

In  large-scale  shared-memory  multiprocessors,  such  as  the  RP3,  Ultracomputer,  Cedar,  all  traffic  to  shared 
variables  must  go  over  the  network3,  and  the  relative  fraction  of  network  accesses  attributable  to  synchroniza¬ 
tion  is  slightly  smaller.  We  measured  memory  traffic  when  shared  variables  were  not  cached  and  found  that 
synchronization  traffic  accounted  for  25.5%,  49.2%,  and  1.47%  of  the  total  traffic  in  SIMPLE,  WEATHER,  and 
FFT,  respectively.  Our  motivation  for  reducing  the  network  traffic,  especially  traffic  that  is  partial  to  a  specific 
memory  location,  still  remains. 

The  adaptive  backoff  techniques  we  are  proposing  are  software  solutions  to  help  alleviate  the  hot-spot 
contention  problem  by  reducing  the  number  of  idle  synchronization  spins.  These  techniques  could  even  be  used 

*  Although  temporary  caching  of  ahared  locations  with  compiler  inserted  cache  flush  directives  can  help  relieve  network  load. 
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Application 

Pointers 

Traffic  (%) 

SIMPLE 

2 

22.0 

3 

50" 

4 

55T 

8 

50" 

64 

au 

WEATHER 

2 

65.4 

3 

UT 

4 

~  57T 

6 

57.6 

64 

59.9 

FFT 

2 

1.3 

a 

""  1.4 

4 

1.5 

5 

1.5 

64 

1.8 

Table  2:  Synchronisation  traffic  to  main  memory  u  a  percentage  of  the  total  traffic  when  the  synchronisation  variables 
are  not  cached.  Block  use  is  assumed  to  be  16  bytes  and  cache  size  is  2S6KBytes.  The  non-synchronisation  blocks  are 
cached  and  coherence  is  maintained  using  directory  schemes  with  2,  3,  4,  5,  and  64  pointers. 


in  conjunction  with  hardware  solutions  such  as  a  combining  network.  The  combining  network  is  much  slower 
than  a  conventional  network,  so  we  still  would  like  to  reduce  the  amount  of  synchronization  traffic  traversing 
the  network. 


3  The  Network  Model 

The  network  model  that  we  assume  is  the  following:  processors  can  access  any  memory  over  the  network  in  one 
network  cycle.  We  do  not  model  network  contention,  but  do  model  contention  for  the  barrier  variable  and  flag. 
We  also  assume  that  the  barrier  variable  and  flag  are  in  different  memory  modules,  so  simultaneous  requests 
to  the  two  by  different  processors  can  be  satisfied.  We  assume  that  in  a  network  cycle  only  one  processor  can 
access  the  barrier  variable  or  the  barrier  flag.  If  a  processor  is  denied  access  to  the  variable  in  a  network  cycle  it 
repeats  the  access  to  the  variable  in  the  next  network  cycle.  This  model  might  correspond  to  a  crossbar  switch 
where  the  only  contention  is  for  the  end  memory  modules  that  have  the  barrier  variable  and  flag;  contention 
due  to  other  non-synchronization  references  is  not  included.  It  also  roughly  approximates  the  performance  of  a 
circuit-switched  multistage  interconnection  network,  where  the  network  cycle  time  can  be  the  round-trip  time 
over  the  network.  In  the  latter  case  the  contention  at  intermediate  network  nodes  is  not  included. 

The  network  traffic  rates  computed  using  our  barrier  scheme  might  also  be  input  into  a  more  complex  model 
of  a  multistage  interconnection  network  such  as  that  proposed  by  Patel  [17]  if  network  contention  results  are 
desired.  Unfortunately  Patel’s  model  does  not  account  for  hot-spot  contention.  We  are  also  using  large  parallel 
traces  of  real  applications  derived  using  various  synchronization  schemes  to  drive  network  models  to  obtain 
performance  estimates  in  the  presence  of  hot-spots  caused  by  barrier  traffic  and  when  the  barrier  traffic  is 
reduced  using  our  techniques. 


4  Adaptive  Backoff  Barrier  Synchronization 

The  basic  idea  behind  adaptive  backoff  methods  is  simple.  An  adaptive  backoff  barrier  technique  makes  use  of 
available  information  in  deciding  how  long  to  wait  before  trying  to  read  a  barrier  flag  rather  than  continuously 
polling  the  flag.  If  necessary,  the  adaptive  method  can  also  provide  a  hint  to  the  processor  to  queue  itself  on 
the  barrier  flag. 

We  will  assume  barriers  implemented  using  a  separate  barrier  variable  and  a  barrier  flag  as  described  earlier. 
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If  the  barrier  variable  and  flag  are  one  and  the  same  object,  the  relative  advantage  of  using  adaptive  backoff 
techniques  will  be  even  greater. 

4.1  Backoff  on  the  barrier  variable 

The  first  method,  called  backoff  on  the  barrier  variable,  is  the  simplest  and  tries  to  reduce  unnecessary  network 
accesses  on  the  barrier  flag.  In  this  method,  the  barrier  implementation  is  optimised  by  making  use  of  the 
state  of  the  barrier  variable.  The  barrier  variable  value  reveals  the  number  of  processors  waiting  at  the  barrier. 
Let  there  be  N  processors  that  must  arrive  at  the  barrier,  and  let  the  average  memory  access  time  over  the 
network  be  1  cycle  as  mentioned  earlier.  If  i  processors  have  reached  the  barrier,  then  an  arriving  processor  can 
start  polling  the  barrier  flag  at  least  (N-i)  cycles  after  reaching  the  barrier  variable  A.  Waiting  to  re-poll  the 
barrier  variable  can  be  implemented  as  a  processor  loop  that  does  not  access  memory,  with  the  loop  count  set 
as  function  of  the  waiting  time. 

4.2  Backoff  on  the  barrier  flag 

We  will  also  look  at  other  methods  that  try  to  further  reduce  the  number  of  spins  on  the  barrier  flag.  Processors 
can  keep  track  of  the  number  of  times  they  have  polled  the  barrier  flag  and  correspondingly  backoff  by  a  linear 
or  exponential  amount  the  longer  they  have  waited.  This  code  can  be  part  of  the  barrier  implementation  in 
software  and  needs  no  hardware  support.  We  call  this  group  of  techniques  backoff  on  the  barrier  flag.  In  all  our 
discussions  of  the  performance  of  these  latter  methods,  we  assume  that  backoff  on  the  barrier  variable  is  also 
applied. 

In  backoff  on  the  barrier  variable,  if  the  interarrival  times  of  processors  are  very  large,  then  a  processor  might 
wait  its  N-i  cycles  and  start  polling  the  barrier  flag  long  before  the  last  processor  arrives  at  the  barrier.  In 
these  situations,  we  might  wait  longer  before  polling  the  flag,  say  (N-i)+C  or  (N-i)*C,  where  C  is  some  positive 
integer.  While  this  might  reduce  the  number  of  unnecessary  network  accesses,  it  might  also  cause  the  processor 
to  remain  idle  and  miss  accessing  the  barrier  at  the  earliest  it  becomes  available.  We  suggest  some  methods  of 
choosing  appropriate  backoff  parameters  in  Section  8. 

In  backoff  on  the  barrier  flag,  there  exists  a  danger  of  backing  off  much  more  than  necessary.  Clearly  there  is 
a  tradeoff  between  network  access  reduction  and  cpu  idle  time.  If  only  a  few  processors  are  involved  in  a  barrier 
synchronization,  then  to  reduce  the  hot-spot  contention  problem,  one  might  prefer  to  take  the  hit  in  cpo  idle 
time  fo;  these  contending  process;;;  so  that  the  remaining  processors  in  the  system  can  perform  unhindered. 
As  mentioned  before,  even  a  small  percentage  of  memory  references  to  the  same  “hot”  memory  module  can 
result  in  severe  congestion  of  the  interconnection  network,  thereby  reducing  all  processors’  utilization  [19].  Of 
course,  if  all  processors  in  a  system  are  involved  in  a  barrier  synchronization,  then  the  cpu  idle  time  becomes 
an  important  consideration. 

Note  that  the  backoff  algorithm  we  use  is  deterministic,  unlike  the  adaptive  control  algorithms  used  in  [13, 
12, 14]  where  the  probability  of  a  retry  is  adaptively  adjusted.  We  choose  this  route  for  the  following  reasons:  1) 
we  want  backoffs  to  be  as  efficient  as  possible.  Our  deterministic  backoff  can  be  computed  in  a  few  instructions 
as  opposed  to  the  hundreds  of  instructions  which  would  be  necessary  to  compute  retry  probabilities  adaptively 
and  determine  whether  or  not  to  perform  a  retry  every  cycle;  and  2)  often  when-  processors  first  contend 
for  a  synchronization  variable  such  as  a  barrier  flag,  their  execution  becomes  serialized.  Once  serialized,  the 
processors  experience  no  contention  the  next  time  they  poll  the  barrier  flag.  Since  all  the  processors  backoff  by 
equal  amounts  the  serialization  is  preserved.  However,  if  the  processors  retry  probabilistically,  the  serialisation 
is  destroyed  and  could  result  in  contention  again. 

Backoff  decisions  are  made  only  when  a  process  has  just  updated  the  barrier  variable,  and  when  the  process 
has  read  the  barrier  flag  and  the  flag  is  not  set.  So,  once  a  processor  initiates  a  barrier  read  request,  the 
network  controller  for  that  processor  attempts  to  read  the  barrier.  If  contention  thwarts  this  attempt,  the 
access  is  repeated  until  the  flag  is  read.  We  do  propose  some  other  schemes  where  the  network  controller  can 
back  off  if  the  congestion  in  the  network  is  high. 

For  software-tree  based  implementations  of  barriers  on  non-cache-coherent  multiprocessor  as  suggested  by 
Yew,  Tzeng,  and  Lawrie  [25],  our  methods  can  still  be  used  to  reduce  the  spins  on  the  intermediate  nodes  of 
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Figure  2:  Intervale  of  execution  and  synchronisation. 


the  tree. 

We  evaluate  these  ideas  using  a  barrier  model  through  analysis  and  simulations  and  discuss  the  tradeoffs 
between  reduced  synchronisation  accesses  and  wasted  epu  cycles. 


5  A  Barrier  Model 

We  will  first  describe  the  model  that  we  use  to  evaluate  barriers.  We  use  two  metrics:  (1)  the  number  of 
network  accesses  per  process  in  accessing  the  barrier  variable  and  barrier  flag;  and  (2)  the  number  of  cycles  that 
an  average  process  spends  from  the  time  it  arrives  at  the  barrier  to  the  time  it  is  allowed  to  proceed  from  the 
barrier. 

Overall  performance  is  impacted  by  the  total  network  traffic,  which  includes  the  regular  non-barrier  traffic 
and  the  barrier  traffic.  Because  we  currently  do  not  model  hot-spot  traffic  contention  in  the  network,  we 
preferred  to  present  the  numbers  for  the  barrier  traffic  alone,  as  average  numbers  for  overall  traffic  might  be 
misleading  in  terms  of  the  adverse  effect  of  the  barrier  traffic  focused  on  one  memory  module.  We  also  provide 
measurements  of  the  time  between  barrier  accesses  in  parallel  applications.  If  necessary  our  barrier  traffic 
numbers  can  be  amortised  over  this  entire  period  to  get  the  contribution  of  barrier  traffic  to  overall  average 
traffic. 

Let  us  define  A  to  be  the  time  interval  during  which  processes  can  arrive  at  the  barrier.  A  is  the  time  from 
the  first  processor’s  arrival  at  the  barrier  variable  to  the  last  processor’s  arrival  at  the  barrier  variable.  The 
complementary  interval  between  these  two  events  we  call  E,  i.e.,  the  time  between  barriers  in  an  application.  If 
we  were  to  follow  an  application’s  execution  through  time,  E  and  A  would  appear  as  shown  in  Figure  2. 

We  measured  A  for  our  three  applications.  In  Table  3,  A  is  defined  to  be  the  number  of  epu  cycles  from  the 
time  the  first  processor  starts  polling  the  barrier  flag  to  the  time  the  last  processor  sets  the  barrier  flag.  It  is 
interesting  to  note  that  the  average  A  for  SIMPLE  and  WEATHER  did  not  increase  as  greatly  as  for  FFT  when 
going  from  16  to  64  processors.  For  highly  uniform  and  load-balanced  applications  such  as  FFT  the  spread 
among  arrivals  is  primarily  due  to  the  serialisation  which  takes  place  at  the  loop  index  assignment.  Thus,  FFT 
was  relatively  more  affected  than  the  other  applications  when  the  number  of  processors  increased. 

The  reason  E  and  A  for  SIMPLE  and  WEATHER  with  64  processors  are  similarly  sued  intervals  is  because 
the  applications  were  not  perfectly  load-balanced.  Not  all  the  parallel  loops  contained  a  nice  multiple  of 
iterations  which  could  be  distributed  evenly  among  all  processors.  The  few  processors  who  did  not  get  work 
went  straight  to  the  barrier  at  the  end  of  the  loop. 

The  barrier  model  that  we  use  for  our  analysis  and  simulations  is  actually  slightly  different  and  allows  us 
to  model  a  varying  number  of  synchronizing  processors  for  a  given  value  of  A.  Our  measurements  of  A  from 
the  applications  were  for  a  relatively  large  number  of  processors  and  this  measurement  yields  an  indication  of 
the  maximum  tune  span  between  the  first  and  last  arrival  at  a  synchronisation  point  in  that  application.  It  is 
likely  that  a  smaller  number  of  processors  can  have  an  actual  value  of  A  much  smaller  than  this  maximum  span. 
Therefore,  we  now  define  A  to  be  the  interval  during  which  processors  may  arrive  at  the  barrier,  and  N  to  be 
the  number  of  synchronising  processors.  We  further  assume  that  each  processor  has  a  uniform  probability  of 
appearing  at  any  time  instant  during  the  interval  A.  From  the  uniform  probability  of  arrival  during  the  interval 
A  we  must  compute  the  average  time  span  between  the  first  and  last  arrivals  out  of  a  total  of  N  arrivals.  This 
span  must  tend  to  A  as  N  becomes  large. 
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|  Application 

Processors 

A 

_ El 

'  aIMPtfe 

16 

7021 

mvmm 

64 

— 

WEATHER 

16 

82754 

495298 

64 

82787 

82716 

FFT 

16 

237 

228073 

64 

Mmm 

Table  3:  Average  number  of  cycles,  A,  between  first  and  last  arrivals  at  waits  and  barriers.  E  is  the  average  number 
of  cycles  between  the  last  arrival  at  the  previous  barrier  (or  wait)  and  the  first  arrival  at  the  next  barrier  (or  wait),  i.e. 
it  is  the  average  time  between  barriers  or  waits. 


AlMIfew 

Figure  3:  Arrival  distribution  of  the  processors  involved  in  a  synchronization  during  the  interval  A. 


To  determine  whether  our  assumption  of  uniform  probability  of  arrival  within  A  was  reasonable  we  measured 
the  arrival  times  in  our  applications  and  plot  the  times  in  a  histogram  in  Figure  3.  It  is  easy  to  see  that  the 
distribution  is  roughly  uniform  for  FFT  but  is  skewed  towards  the  beginning  and  the  end  of  the  interval  for 
SIMPLE.  This  skewing  occurs  because  of  uneven  load-  balancing.  We  observed,  however,  in  the  last  peak  that 
processor  arrivals  were  still  uniform  over  the  last  200  references.  There  seems  to  be  no  real  pattern  and  our 
assumption  of  a  uniform  distribution  is  not  expected  to  significantly  change  our  results  lor  minor  variations  in 
the  arrivals.  We  also  present  additional  validation  of  this  model  by  comparing  the  predictions  obtained  through 
simulations  using  the  model  and  through  measurements  using  the  actual  traces  in  Section  7.1. 

5.1  Analytically  Estimating  Barrier  Performance 

We  first  present  some  simple  calculations  for  extreme  cases  of  A  to  determine  the  bounds  on  the  possible  savings 
and  to  provide  insight  into  our  simulations. 

For  the  case  A  =  0  (all  processors  arrive  simultaneously)  and  no  backoff,  a  processor  will  make  on  average 
N  +  N  +  N/2  synchronization  references.  Each  processor  makes  on  average  N/2  references  to  get  at  the 
barrier  variable,  polls  the  barrier  flag  N/2  references  before  the  last  processor  gets  through  the  barrier  variable, 
continues  polling  the  barrier  flag  N  times  until  the  last  processor  can  set  the  flag,  and  finally  leaves  after  N/2 
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references,  on  average.  We  denote  this  model  that  assumes  arrival  at  the  same  msta.it  as  Model  1. 

Jf  \  >>  ft t  there  is  practically  no  contention  to  get  the  barrier  variable.  In  this  case  we  assume  that 
processors  appear  at  the  barrier  at  a  given  time  instant  within  the  time  interval  A  with  uniform  probability. 
Let  us  first  compute  the  average  time  span  r  between  the  first  and  the  last  arrival  within  the  interval  A  given 
N  processors.  The  average  time  from  the  beginning  of  the  interval  to  the  first  arrival  can  be  shown  to  be 
A/(N  + 1),  and  the  average  time  from  the  last  arrival  to  the  end  of  the  interval  to  be  AN/(N  +  1).  The  required 
time  span  r  is  the  difference  of  the  two,  or 


T  — 


N  -  1 

'N  +  1 


(1) 


Observe  that  r  approaches  A  as  N  becomes  large  Thus,  each  processor  make  on  average  r/2-f  N  +  N/2  network 
accesses  during  the  synchronization  phase.  We  call  this  Model  2. 

Let  us  now  consider  backoff  on  the  barrier  variable.  In  this  technique,  we  backoff  an  amount  proportional 
to  the  value  of  the  barrier  variable.  If  i  is  the  value  of  the  barrier  variable  upon  a  processor’s  arrival,  then  the 
processor  can  wait  N-i  cycles  before  beginning  to  poll  the  barrier  flRg  When  A  —  0,  the  average  number  of 
synchronization  accesses  becomes  N/2  +  N  +  N/2  cycles  because  the  processor  docs  not  start  polling  the  flag 
until  the  last  processor  gets  through  the  barrier.  A  similar  savings  of  N/2  is  made  for  A  »  N.  With  backoff 
only  on  the  barrier  variable,  the  potential  sa  dugs  get  smaller  as  A  gets  larger  because  the  savings  is  a  constant 
N/2  no  matter  what  A  is. 

Of  course,  a  modified  scheme  that  backs  off  some  constant  factor  times  the  value  in  the  barrier  to  account 
for  the  non-unit  time  cost  of  accessing  the  barrier  value,  will  provide  a  higher  savings  in  network  traffic,  but 
it  also  adds  the  potential  of  increasing  epu  idle  time.  We  still  have  more  state  information  we  can  use  in  the 
barrier:  the  number  of  times  the  barrier  flag  has  been  polled. 

Rather  than  continuously  polling  the  barrier  flag  until  it  is  set,  we  backoff  by  seme  function  of  the  number  of 
times  we  have  already  read  the  shared  variable.  Backoff  on  the  barrier  flag  is  especially  useful  when  A  >  N .  In 
addition  it  can  also  help  prevent  interference  with  the  final  processor  write  request  that  will  release  the  processes 
waiting  on  the  flag.  From  Model  2  for  A  >>  N  presented  earlier,  the  potential  savings  in  network  accesses  can 
be  as  large  as  /o<j»(r/2)  for  exponential  backoff,  where  b  is  the  basis  of  the  exponential  backoff  algorithm  used. 
The  backoff  on  the  barrier  flag  can  incur  a  high  penalty  -  we  might  backoff  too  far,  and  waste  epu  cycles.  This 
idea  is  tested  out  in  simulations  which  are  discussed  in  the  next  section. 


Finally,  we  present  some  network  access  rates  for  barriers  on  multiprocessors  with  hardware  support  for 
barrier  synchronization  to  provide  a  basis  for  comparison  with  the  backoff  schemes.  Examples  of  such  hardware 
support  are  a  bus  to  allow  global  invalidations  (or  global  update)  of  cache  eutries,  a  directory  with  a  full  pointer 
map,  and  special  logic  to  implement  a  global  synchronization  gate  [10].  If  there  are  n  processors  the  invalidating 
bus  incurs  3n+l  accesses  for  a  barrier,  n  fetches  of  the  barrier  variable,  n  invalidations  for  n  writes  of  the  barrier 
variable,  n  fetches  of  the  flag,  and  the  final  global  invalidation  caused  by  the  write  into  the  barrier  flag,  yielding 
roughly  3  accesses  per  processor  per  barrier  operation.  The  updating  bus  (or  an  invalidating  scheme  that  can 
detect  a  fetch  with  intent  to  write)  would  use  n  less  than  the  previous  scheme  for  roughly  2  bus  accesses  per 
processor.  Like  the  bus,  the  directory  scheme  must  incur  3n  on  barrier  variable  accesses  and  invalidations,  and 
flag  accesses,  but  lacking  a  global  broadcast  must  incur  an  additional  n  for  the  individual  invalidates  on  the 
final  write  to  the  barrier  flag,  yielding  4  on  average  per  processor  per  barrier  operation.  The  Hoehino  scheme 
uses  n  accesses  to  the  global  synchronization  gate  and  the  final  single  broadcast  message  to  the  participants  to 
inform  them  to  proceed,  for  a  per-processor  average  of  1. 


5.2  Simulation  Methodology 

We  also  use  simulations  to  predict  harrier  performance  with  and  withou'  backoff.  The  barrier  and  network 
models  are  the  same  as  described  previously  Our  simulation  methodology  is  described  here. 

In  our  simulations  we  set  a  value  for  A  and  simulated  processors  arriving  with  uniform  probability  during 
this  interval.  Each  processor  first  increments  the  barrier  variable  and  then  spins  on  the  barrier  flag  until  it  is 
set  by  the  last  arriving  processor.  Our  previous  data  in  Table  3  showed  that  for  three  applications  the  value 
of  E  was  between  6195  and  495298  cycles  on  average  and  the  value  of  A  was  between  237  and  82787  cycles. 
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Clearly  a  wide  range  is  possible  and  so  we  simulated  A  with  a  wide  range  and  we  will  show  the  results  for  A  =  0, 
100,  1000  for  brevity.  The  important  factor  here  is  the  relative  size  of  the  interval  to  the  number  of  processors 
involved  in  the  barrier  -  as  our  results  will  show.  We  chose  A' s  which  span  the  entire  spectrum. 

Each  simulation  run  measured  the  average  number  of  network  accesses  made  by  a  process  from  the  time  it 
arrived  at  the  barrier  variable  to  the  time  it  proceeded  from  the  barrier  flag  after  having  successfully  tested  the 
flag  and  observing  a  true  value.  As  mentioned  before,  the  number  of  network  accesses  includes  contention  for 
the  barrier.  We  also  measured  the  average  time  each  process  spent  from  the  time  it  arrived  at  the  barrier  to 
the  time  it  left. 

The  simulation  for  each  set  of  parameters  is  repeated  100  times  and  the  numbers  are  averaged  over  all  the 
runs  to  compensate  for  the  random  variations  due  to  the  assumption  of  a  uniform  probability  of  arrival.  We 
verified  that  for  each  of  the  numbers  we  present  the  standard  deviation  was  less  than  about  ?%  over  the  hundred 
runs 


6  Evaluation 

We  evaluate  the  backofT  methods  using  the  models  just  described.  This  section  first  compares  the  predictions  of 
the  model  with  simulations.  We  then  estimate  the  potential  savings  in  network  traffic  using  backoff  techniques 
and  discuss  the  tradeoffs  involved  in  choosing  the  right  parameters  for  the  backoff  algorithm. 

6.1  Estimating  the  Potential  Reduction  in  Traffic  Using  Analysis 

We  will  first  analyze  the  accuracy  of  our  simple  model  in  predicting  the  behavior  of  the  barrier  synchronization 
under  various  load  conditions.  The  model  will  indicate  the  range  of  performance  gains  that  we  might  expect 
using  the  backoff  techniques  and  give  insight  into  our  simulation  numbers. 

In  Figure  4  we  compare  the  curves  predicted  by  our  model  with  simulation  results  and  display  the  predicted 
network  accesses  for  three  cases:  A  =  0,  A  =  100,  A  =  1000.  We  will  only  compare  the  non-backoff  performance 
for  validation.  The  model  can  be  modified  to  predict  the  performance  of  the  backoff  schemes,  but  for  certain 
cases  it  can  get  quite  complicated.  We  will,  however,  mention  what  terms  in  the  model  equations  get  impacted 
by  the  various  schemes. 

The  network  accesses  for  A  —  0,  A  =  100  do  not  differ  much  overall,  but  the  way  in  which  they  differ  is 
significant.  For  N  <  32,  A  =  0  results  in  fewer  accesses  than  A  =  100  because  when  A  =  0  processes  do  not 
have  to  wait  for  the  last  processor  to  arrive  at  the  barrier.  For  larger  N,  however,  A  —  100  starts  performing 
better  because  when  the  arrivals  are  spread  out  slightly,  there  is  less  contention  in  accessing  the  barrier.  We 
observe  a  similar  behavior  for  A  =  1000  as  N  approaches  A.  As  expected  when  N  is  small,  A  —  1000  makes  far 
more  accesses  than  A  =  0  or  A  —  100. 

The  model  is  accurate  as  the  figure  shows.  Model  1,  as  expected  matches  the  curves  for  the  A  «  N  cases. 
In  particular,  Model  1  closely  approximates  the  A  =  0  case,  and  yields  a  good  match  with  the  A  =  100  curve 
for  N  >  16. 

Model  2  matches  all  the  cases  where  A  »  N.  Specifically,  the  Model  2  curve  for  A  =  1000  provides  a  near 
perfect  match  with  the  corresponding  simulation  curve  for  all  the  values  of  N  shown.  The  Model  2  curve  for 
A  =  100  matches  the  simulation  A  =  100  curve  for  N  <  128.  When  N  is  greater  than  128,  the  model  begins  to 
underestimate  the  contention  in  accessing  the  barrier  variable.  In  general,  the  maximum  of  the  predictions  of 
the  two  models  yields  a  good  fit  with  simulation  in  all  ranges. 

The  model  implies  that  for  the  case  where  N  >  A,  the  potential  reduction  in  network  traffic  is  20%.  When 
A  >  N,  the  potential  gains  are  much  more  significant.  If  an  exponential  backoff  method  is  used  with  constant 
e,  then  if  the  network  accesses  of  the  flag  were  M,  with  backoff  these  accesses  can  be  reduced  to  the  order  of 
logt(M).  Because  the  waiting  processes  are  not  busily  accessing  the  flag,  the  final  process  that  must  set  the 
flag  can  usually  proceed  to  update  the  flag  without  contending  with  the  other  processes. 
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Figure  4:  Comparing  the  prediction*  of  the  analytical  model  and  prediction*  of  barrier  performance. 

6.2  Simulation  results 

We  now  present  simulation  results  for  barrier  synchronization  performance.  Figure  5  shows  the  net  accesses 
for  N  ranging  from  2  through  512  when  A  =  0,  i.e.,  when  all  processes  arrive  at  the  barrier  at  the  same  time. 
The  curve  follows  the  model  as  shown  before,  which  means  that  the  net  accesses  increase  as  5N/2,  where  N  is 
the  number  of  processors.  The  curves  for  backoff  on  the  barrier  variable  alone,  and  backoff  on  the  barrier  flag 
with  backoff  constant  2,  4, and  8  are  also  shown  (as  mentioned  before,  all  our  simulated  cases  of  backoff  on  the 
barrier  flag  include  first  backing-off  on  the  barrier  variable.) 

Figure  5  corresponds  well  with  our  model’s  prediction  of  an  average  20%  reduction  of  synchronization 
references  due  to  backing  off  with  information  from  the  barrier  variable,  i.e.,  the  backoff  on  the  barrier  variable 
gives  3N/2  network  accesses.  Not  surprisingly,  using  binary  backoff  (or  backoff  with  constants  4  or  8)  on  the 
barrier  flag  made  no  difference  because  everyone  reaches  the  barrier  at  the  same  time  when  A  =  0.  The  backoff 
on  the  barrier  variable  results  in  each  processor  spending  very  little  time  polling  the  barrier  flag  waiting  for  it 
to  change.  For  example,  for  the  64  processor  case,  a  processor  on  average  accessed  the  network  32  times  to  get 
at  the  barrier  variable,  96  times  to  test  the  flag  before  it  was  set,  and  32  times  after  it  was  set,  for  a  total  of 
about  160  network  accesses.  With  backoff  on  the  barrier  variable  this  number  reduced  to  roughly  132,  a  15% 
reduction. 

Backoff  with  A=1000  often  has  a  savings  greater  than  the  log  of  the  time  interval  of  arrival  at  the  barrier 
because  of  reduced  interference  with  the  final  write  request  into  the  flag.  This  phenomenon  also  explains  the 
fewer  network  accesses  for  backoff  with  base  8  at  A=1000  than  at  A=0  for  32  processors.  However,  this  savings 
often  comes  at  the  expense  of  increased  processor  waiting  times. 

Figures  6  and  7  correspond  to  the  network  accesses  by  a  process  for  A  =  100  and  A  =  1000  respectively. 
In  Figure  6  for  the  backoff  on  the  barrier  variable  we  see  similar  savings  as  in  Figure  5  with  A  —  0  because 
the  interval  A  is  still  not  very  big  compared  to  the  number  of  processors.  Note,  however,  the  big  reductions 
that  the  exponential  backoffs  on  the  barrier  flag  gave.  With  A  —  100,  not  everyone  reaches  the  barrier  flag 
simultaneously,  so  the  ones  who  arrive  early  backoff  b,  some  exponential  constant  rather  than  continuously 
polling  the  barrier  flag.  In  the  16  processor  case  with  a  base  4  backoff  on  the  barrier  flag,  for  example,  we  see  a 
savings  of  over  90%.  In  a  64  processor  case  with  an  base  8  backoff,  the  savings  are  in  network  accesses  is  about 
60%. 
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Figure  6:  Performance  of  backoff  algorithm!  for  A  =  100. 
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Figure  7:  Performance  of  backoff  algorithma  for  A  =  1000. 


The  proportional  benefit  due  to  backoff  decreases  as  N  increases  because  contention  in  the  network  to  access 
the  barrier  flag  becomes  a  arable  portion  of  the  network  accesses.  Recall  that  an  unsuccessful  network  access  in 
accessing  the  barrier  flag  is  still  counted  as  a  network  access.  (To  reduce  these  unsuccessful  accesses  one  might 
use  backoff  techniques  in  the  network  accessing.  This  issue  is  discussed  later.)  For  example,  in  the  A  =  100  and 
N  =  512  case  with  base  8  backoff,  the  reduction  in  network  accesses  was  only  about  30%. 

For  A  —  1000  backoff  on  the  barrier  variable  once  again  offers  only  modest  savings.  It  is  interesting  to  note 
that  for  up  to  32  processors  this  scheme  offers  virtually  no  savings,  because  not  many  processors  are  contending 
on  the  barrier  flag.  The  savings  become  more  significant  for  larger  numbers  of  processors  because  the  backoff 
on  the  barrier  variable  reduces  the  length  of  time  that  all  of  the  processors  spend  polling  the  barrier  flag.  For 
256  processors,  for  example,  backoff  on  the  barrier  variable  yields  about  a  15%  improvement. 

The  savings  due  to  exponential  backoff  on  the  barrier  flag  with  A  =  1000,  however,  are  quite  dramatic.  Since 
the  processors  potentially  have  a  large  interval  to  poll  the  barrier  flag  before  everyone  arrives,  exponentially 
backing  off  between  testing  the  flag  helps  tremendously.  In  the  16  processor  case  with  a  binary  backoff  on 
the  flag,  for  example,  we  see  over  a  95%  savings  in  network  accesses.  The  64  processor  case  offers  a  similar 
improvement.  This  reduction  roughly  approximates  a  logt  reduction  in  the  number  of  accesses,  where  b  is  the 
base  used  in  the  exponential  backoff. 

The  small  number  of  network  accesses  with  backoff  on  the  barrier  flag  for  the  cases  >4  =  0  and  N  <  8, 
A  =  100  and  N  <  32,  and  A  =  1000  and  N  <  128,  compares  reasonably  with  the  network  accesses  in  the  bus- 
based  schemes,  the  broadcast  based  schemes,  or  the  Hoshino  scheme,  with  no  extra  hardware  or  the  broadcast 
requirement.  However,  when  A  is  smaller  or  N  is  larger,  the  backoff  schemes  tend  to  do  much  worse  than  the 
schemes  that  have  special  hardware  support  for  synchronisation. 

It  is  clear  that  backoff  on  the  barrier  flag  is  potentially  much  more  beneficial  for  large  A  because  most  of  the 
network  accesses  that  happen  while  the  processes  await  the  remaining  processes  to  arrive  at  the  barrier  can  be 
obviated.  These  accesses  correspond  to  the  first  term  in  the  Model  2  equation.  Backoff  on  the  barrier  variable 
alone  does  not  impact  performance  significantly  when  N  is  small  compared  to  A,  but  can  yield  up  to  a  20% 
improvement  when  N  is  large. 

It  is  interesting  to  see  that  the  network  accesses  increase  dramatically  for  N  =  128  (A  =  1000).  It  seems  that 
the  backoff  techniques  are  not  as  useful  in  this  case  (improvement  is  less  than  about  30%  for  JV  =  256  and  backoff 
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Figure  8:  Processor  waiting  times  for  backoff  algorithms  for  A  =  0. 

with  constant  2),  although  for  these  cases  barrier  synchronization  is  probably  inappropriate  anyway  without 
some  form  of  distributed  software  combining  [25].  Our  backoff  methods  can  still  be  used  on  the  intermediate 
nodes  of  the  combining  tree.  The  reason  for  the  sharp  increase  can  be  described  as  follows:  When  the  number  of 
processors  is  small  compared  to  A,  a  process  can  get  access  to  the  barrier  flag  usually  within  one  network  access. 
However,  when  the  number  of  processors  is  not  small  compared  to  A,  then  a  process  will  suffer  contention  in 
trying  to  access  the  barrier  flag,  and  contention  shows  up  as  repeated  network  accesses. 

In  both  cases  the  network  accesses  can  be  dramatically  reduced  for  N  <  128.  For  larger  N,  when  the 
contention  due  to  multiple  processors  simultaneously  accessing  the  barrier  increases,  the  percentage  benefits 
decrease.  Note  we  do  nothing  about  these  contention  accesses.  A  method  described  in  the  next  section  will 
show  a  method  to  reduce  this  problem. 

Our  simulations  show  that  using  a  backoff  method  on  both  the  barrier  variable  and  the  barrier  flag  can 
yield  savings  from  20%  to  over  95%  of  the  network  accesses.  However,  the  reduction  in  network  traffic  using 
the  backoff  methods  does  not  always  come  for  free.  Because  a  backoff  method  can  cause  unnecessary  processor 
idle  time,  we  must  carefully  analyze  the  delays  that  these  techniques  can  introduce.  The  occurrence  of  delays 
alone  might  not  be  a  major  cause  for  alarm,  because  these  delays  correspond  to  the  delays  suffered  by  the 
synchronizing  processes  alone,  and  do  not  affect  other  processes.  The  next  section  addresses  these  issues. 

7  Discussion  of  Tradeoffs 

An  appropriate  backoff  constant  must  be  determined  by  trading  off  the  reduction  in  network  accesses  with  the 
potential  increase  in  the  number  of  cycles  the  cpu  spends  idling  during  backoff. 

Figures  8  through  10  correspond  to  the  average  waiting  times  for  each  of  the  processes  for  A  =  0,  100,  1000 
respectively.  The  waiting  time  for  a  process  is  computed  as  the  number  of  cycles  between  first  arriving  at  the 
barrier  to  when  the  process  finds  the  barrier  flag  set.  The  graphs  denote  the  four  cases  shown  previously,  that 
is,  without  backoff,  with  backoff  on  the  barrier  variable  and  with  exponential  backoff  on  the  barrier  flag  with 
bases  2,  4  and  8. 

We  see  that  in  all  cases  binary  backoff  provides  a  favorable  tradeoff  between  large  reductions  in  synchroniza¬ 
tion  references  and  contained  increases  in  wasted  cpu  cycles.  In  the  sixty-four  processor  case  when  A  =  1000, 
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Figure  9: 


Figure  10: 


Processor  waiting  times  for  backoff  algorithms  for  A  =  100. 
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for  example,  the  binary  backoff  decreased  synchronization  accesses  by  97%  while  increasing  the  time  spent  at 
the  barrier  by  only  16%. 

For  A  =  0,  and  A  =  100,  the  waiting  times  for  all  the  four  curves  are  similar  because  the  opportunity  for  a 
large  backoff  time  is  rare  given  that  all  the  processes  arrive  within  a  100  cycles  of  each  other.  The  waiting  time 
in  these  cases  is  proportional  to  the  number  of  network  accesses,  as  it  is  precisely  these  network  accesses  that 
give  rise  to  the  delays  at  the  barrier.  This  intuition  is  corroborated  by  the  strong  resemblance  of  the  curves  in 
Figures  6  and  9. 

The  average  time  spent  idling  can  increase  dramatically  when  A  is  large  because  of  the  possibility  of  large 
backoff  times.  This  opportunity  is  greater  for  the  base  4  and  base  8  exponential  backoff  schemes.  As  an  example, 
for  64  processors  and  A  —  1000,  the  waiting  times  without  backoff  and  with  base  8  exponential  backoff  on  the 
flag  are  576  and  2048  respectively  -  depicting  an  increase  of  over  350%  due  to  backoff.  Even  in  this  case,  one 
important  benefit  is  that  the  barrier  accesses  are  both  reduced  and  spread  out  uniformly  over  time. 

When  the  arrival  interval  A  is  much  larger  than  the  number  of  processors,  and  a  high  processor  utilization 
is  important,  one  can  modify  the  backoff  algorithm  as  follows.  If  the  backoff  amount  crosses  some  preset 
threshold,  then  it  might  be  worthwhile  to  place  the  process  on  a  queue  pending  the  arrival  of  the  last  process. 
The  enqueuing  operation  incurs  a  constant  overhead  that  might  be  unnecessary  should  the  processes  arrive 
within  a  small  interval.  Because  A  cannot  often  be  determined  a  priori,  such  a  method  of  deciding  when  to  put 
a  process  to  sleep  might  be  promising. 

Interestingly,  for  the  A  =  1000  case,  the  average  waiting  times  per  processor  reach  a  maximum  around  64 
processors  and  then  actually  decline  as  N  increases.  When  the  number  of  processors  is  small  compared  to  A, 
the  processors  can  test  the  flag  without  excessive  contention  with  other  processes.  After  each  unsuccessful  test, 
they  back  off,  and  the  backoff  time  is  exponentially  related  to  the  number  of  times  they  unsuccessfully  back  off. 
Because  the  number  of  such  accesses  can  be  quite  large  when  contention  is  low  and  A  is  large,  there  arises  the 
potential  for  overshooting  the  point  where  the  flag  is  set  by  a  large  amount.  Conversely,  when  the  number  of 
processors  is  comparable  to  A  (or  greater  than  A),  the  number  of  times  a  process  manages  to  access  the  barrier 
flag  is  small  due  to  contention  with  other  processes.  In  such  cases,  the  network  access  count  increases,  but  the 
average  waiting  time  per  processor  decreases.  Referring  to  Figures  7  and  10  the  decrease  in  the  waiting  time 
for  the  backoff  curves  closely  corresponds  to  the  increase  in  network  accesses. 

7.1  Summary 

A  few  general  observations  can  be  made  at  this  point.  When  the  number  of  processors  participating  in  the 
barrier  synchronization  is  small  compared  to  the  time  of  arrival  of  the  processors,  significant  reduction  in 
network  accesses  can  be  achieved  without  compromising  processor  utilization  due  to  backoff  waiting  for  a  small 
backoff  base.  In  such  cases,  the  number  of  synchronization  network  accesses  is  similar  to  those  made  in  schemes 
that  use  special  hardware  support  such  as  synchronization  buses,  broadcasts,  or  global  synchronization  logic. 
When  the  number  of  processors  is  large,  and  if  they  arrive  within  a  relatively  small  interval  of  time,  a  penalty 
in  either  network  accesses  or  processor  idle  time  must  be  paid.  However,  depending  on  the  situation,  one  can 
be  traded  for  the  other. 

Our  discussion  thus  far  focused  on  the  traffic  and  the  waiting  time  during  the  execution  of  the  barrier.  We 
can  also  look  at  the  effect  on  average  traffic  with  the  caveat  that  such  smoothing  might  tend  to  make  barrier 
accesses  seem  less  disruptive.  We  measured  the  average  network  data  traffic  per  processor  in  FFT  (assuming 
separate  packet-switched  networks  for  the  request  and  response),  excluding  synchronization  references,  to  be 
0.133  network  accesses  per  cycle.  Using  results  from  our  simulations  of  the  barriers  with  A  =  100  (roughly 
approximating  the  barrier  interval  A  in  FFT  with  64  processors)  we  compute  the  extra  traffic  due  to  barriers 
when  the  barrier  variable  and  the  barrier  flag  are  not  cached.  Adding  these  synchronization  references  to  our 
base  network  traffic,  the  average  traffic  increases  to  0.136  network  accesses  per  cycle  (assuming  that  the  base 
traffic  in  A  is  also  0.133).  Now,  with  a  base  8  exponential  backoff  we  find  that  the  average  network  traffic  drops 
to  0.134.  This  decrease  is  significant  considering  that  these  savings  come  from  reductions  in  synchronization 
references  which  are  effectively  hot-spot  references.  Moreover,  we  observe  in  this  case  that  the  base  8  exponential 
backoff  also  results  in  a  10  percent  decrease  in  waiting  time  at  the  barrier.  Both  average  network  traffic  and 
waiting  time  at  synchronizations  are  reduced  using  backoff  methods  for  our  FFT  application. 
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As  a  validation  of  our  barrier  simulation  model,  we  also  compared  the  average  network  traffic  in  FFT 
when  synchronization  references  are  not  cached  with  the  average  network  traffic  predicted  by  our  barrier  model 
■imulutirmn.  The  numbers  correlated  well,  with  barrier  simulations  predicting  0.136  net  accesses  per  cycle  per 
processor,  while  measurements  from  FFT  yielded  0.135. 

We  analyzed  the  tradeoff  between  network  accesses  and  processor  idle  time  due  to  backoff.  In  general, 
reducing  the  number  of  network  accesses  might  be  more  important  than  reducing  the  processor  idle  time 
because  reducing  the  number  of  network  accesses  also  reduces  the  processor  idle  time  because  of  the  reduced 
contention  in  the  network,  and  because  of  decreased  competition  with  the  regular  network  activities  of  the  other 
processors  not  involved  in  the  barrier. 


8  Optimizations  and  Extensions 

This  paper  focused  on  the  effect  of  adaptive  backoff  techniques  on  barrier  synchronization.  The  same  methods 
can  be  applied  in  several  other  cases.  For  example,  this  technique  can  be  applied  to  processors  waiting  on  a 
resource.  Processors  waiting  to  access  a  resource  can  backoff  testing  the  resource  by  an  amount  proportional 
to  the  number  of  processors  waiting.  Adaptive  techniques  will  likely  perform  much  better  in  this  situation 
than  with  barrier  synchronizations  because  the  amount  of  time  a  processor  has  to  wait  at  a  resource  is  directly 
proportional  to  the  number  of  processors  waiting  (with  the  constant  of  the  proportion  being  the  average  amount 
of  time  the  resource  is  held  by  each  processor).  In  a  barrier  situation,  the  amount  of  time  a  processor  has  to 
wait  at  the  barrier  flag  is  not  necessarily  directly  proportional  to  the  number  of  processors  which  have  reached 
the  barrier. 

Another  similar  method  that  can  reduce  contention  in  unbuffered  circuit-switched  networks  is  to  use  adaptive 
backoff  methods  for  network  accesses  also.  If  a  network  access  suffers  a  collision,  instead  of  resubmitting  the 
request  immediately,  one  can  backoff  some  amount  first.  This  backoff  amount  can  be  determined  in  one  of 
several  ways: 

(1)  For  example,  a  network  supplied  status  byte  can  be  used  to  determine  the  stage  at  which  the  collision 
occurred.  The  backoff  amount  can  be  proportional  to  the  network  depth  traversed  by  the  message.  The 
rationale  for  this  choice  is  that  the  deeper  a  message  travels,  the  greater  the  network  resource  that  it  ties  up 
in  its  unsuccessful  attempt.  Conversely,  if  a  collision  occurs  within  a  few  stages  of  travel  into  the  network,  the 
access  can  be  resubmitted  sooner  as  the  network  resources  tied  up  will  be  smaller. 

(2)  An  argument  for  making  the  backoff  amount  inversely  proportional  to  the  network  depth  traversed  can 
also  be  made.  The  deeper  a  message  travels  before  colliding,  the  less  congested  the  network  is  expected  to  be, 
and  so  the  access  can  be  retried  sooner.  Simulations  can  be  used  to  study  the  tradeoffs  involved  in  these  two 
opposing  arguments  and  suggest  a  practical  backoff  algorithm. 

(3)  On  a  collision,  a  network  access  might  wait  some  constant  time  proportional  to  the  average  round  trip 
time  to  memory  through  the  network  before  resubmitting  the  request. 

(4)  The  number  of  previous  unsuccessful  tries  can  be  used  as  a  parameter  to  an  exponential  backoff  algorithm. 

(5)  In  a  packet-switched  network,  Scott  and  Sohi  [20]  make  use  of  the  state  information  found  in  the  queues  at 
the  memory  modules  to  signal  processors  to  stop  making  requests  in  congested  situations.  This  state  information 
could  also  be  used  to  have  the  processors  back  off  sending  requests  by  some  time  proportional  to  the  length  of 
the  queue. 

As  we  mentioned  before,  the  adaptive  backoff  techniques  that  we  evaluated  do  not  require  special  hardware 
support.  The  synchronization  software  that  determines  which  backoff  method  is  used  can  be  designed  in  one 
of  several  ways.  One  can  be  conservative  and  use  a  simple  adaptive  backoff  on  the  barrier  variable  and  a 
binary  backoff  on  the  barrier  flag.  The  programmer  can  write  the  algorithms  into  the  synchronization  macros 
or  routines  from  a  knowledge  of  the  application.  The  compiler  can  determine  appropriate  code  sequences  for 
the  barrier  synchronizations  based  on  expected  behavior  of  loops  and  the  amount  of  visible  parallelism.  One 
can  get  more  venturesome  by  using  profiling  to  determine  the  temporal  behavior  of  the  application  and  the 
number  of  processors  participating  in  the  synchronization  and  pass  this  information  on  to  the  compiler  for 
further  optimization.  One  case  where  such  information  might  be  useful  is  in  determining  when  to  (or  whether 
to)  queue  a  process  to  await  a  signal  when  the  Larrier  flag  is  set  rather  than  spinning  on  the  network. 
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9  Conclusions 


Network  bandwidth  is  a  precious  resource  in  large-scale  shared  memory  multiprocessors.  In  this  paper  we 
present  a  group  of  adaptive  synchronisation  techniques  aimed  at  reducing  the  number  of  network  accesses 
due  to  synchronisations.  We  model  adaptive  techniques  for  barrier  synchronisations  and  show  that  in  some 
cases  these  techniques  can  achieve  dramatic  savings  at  minimal  extra  cost,  while  in  other  situations  network 
accesses  can  be  reduced  while  trading-off  processor  utilisation  of  synchronizing  processors.  These  techniques 
are  implemented  in  software,  and  they  can  be  optimized  for  varying  applications. 

The  central  idea  behind  an  adaptive  synchronization  technique  is  to  make  use  of  information  available  from 
synchronization  state  and  from  past  history  to  reduce  the  number  of  idle  synchronization  spins.  The  general 
technique  is  useful  for  barrier  synchronizations  as  well  as  other  situations  such  as  reducing  accesses  made  by 
processors  waiting  on  a  resource  or  reducing  contention  in  unbuffered  circuit-switched  networks. 
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A  Tracing  Methodology 

The  multiprocessor  traces  we  used  for  our  simulations  were  generated  using  a  “post-mortem  scheduling”  tech¬ 
nique  in  which  a  multiprocessor  trace  is  created  from  a  memory  reference  trace  of  a  uniprocessor  execution  of 
a  parallel  application.  Key  to  the  scheme  is  that  the  uniprocessor  execution  trace  include  information  about 
synchronization  events  in  the  code.  Using  this  record  of  synchronization  events,  a  scheduler  can  schedule  tasks 
from  the  uniprocessor  execution  trace  into  a  multiprocessor  trace  in  which  the  synchronization  sections  are 
simulated. 

This  methodology  can  be  used  for  a  variety  of  programming  paradigms.  The  two  applications  we  traced  are 
both  written  in  Epex/Fortran  using  the  Single-Program-Multiple-Data  (SPMD)  computational  model  [6].  In 
this  model  all  processes  are  created  at  the  beginning  of  the  program  and  execute  the  same  program.  Though 
all  processes  are  executing  the  same  program,  synchronization  constructs  embedded  in  the  code  dynamically 
determine  which  sections  of  the  program  processors  execute.  The  SPMD  model  for  Epex/Fortran  contains 
serial  and  parallel  sections  along  with  replicate  sections,  which  are  executed  by  all  processors.  We  use  this 
model  in  the  FFT,  SIMPLE  and  WEATHER  applications  because  it  is  a  good  method  by  which  to  exploit 
the  parallelism  in  these  scientific  applications  without  making  major  changes  (likely  modifying  the  fundamental 
algorithms  used)  to  the  already  existing  uniprocessor  code. 

The  post-mortem  scheduler  simulates  synchronization  events  in  the  application  using  some  prescribed  syn¬ 
chronization  implementation.  We  simulate  fetch-and-adds  (F&A),  a  synchronization  primitive  used  to  ex¬ 
clusively  update  a  location  in  memory,  with  an  atomic  read-modify-write  operation.  In  EPEX/FORTRAN, 
synchronization  constructs  at  the  beginning  of  parallel  and  serial  sections  perform  FfcAs  on  shared  variables  to 
determine  task  assignments  to  processes.  Barriers  and  waits  at  the  end  of  loops  and  serial  sections  are  simulated 
by  arriving  processors  first  incrementing  a  shared  variable  through  a  F&A  and  then  polling  a  barrier  flag  until 
it  is  set  by  the  last  arriving  processor. 

The  uniprocessor  memory  reference  trace  with  synchronization  information  was  produced  by  PSIMUL  [21], 
a  multiprocessor  simulator.  PSIMUL  generates  IBM  S/370  memory  reference  traces  and  has  the  capability  of 
marking  down  into  the  trace  the  type  of  synchronization  constructs  it  traverses  while  tracing  the  application. 
Our  scheduler  simulates  a  parallel  execution  of  this  trace,  assigning  processors  references  from  the  trace  on  a 
round-robin  basis.  We  assume  that  processors  make  a  memory  reference  every  cycle,  which  is  an  approximation 
because  the  S/370  instruction  set  contains  register-to-register  instructions. 
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The  Fast  Fourier  Transform  (FFT)  [4]  application,  written  at  IBM,  k  a  parallelised  version  of  a  Radix-2 
FFT  computation  in  two  \  iables  on  a  random  array  of  complex  numbers.  Since  we  used  a  problem  size  of 
128,  the  parallel  loops  working  on  the  128x128  matrix  contained  128-way  parallelism.  Thu  provided  for  an 
even  dktribution  of  work  among  processors  for  the  64  processor  simulations.  We  traced  two  passes  of  the  TF2 
routine,  which  computes  the  FFT,  through  the  matrix,  first  by  rows  and  then  by  columns.  FFT  k  an  example 
of  a  highly  uniform  parallel  application  in  which  processors  execute  parallel  loop  iterations  of  approximately 
equal  length  and  arrive  at  barriers  within  close  intervals. 

The  SIMPLE  code  modek  hydrodynamic  and  thermal  behavior  of  fluids  in  two  dimensions  [5].  Finite 
difference  methods  are  used  to  solve  the  equations  of  inviscid  compressible  hydrodynamics  and  simple  heat 
conduction.  The  problem  k  formulated  on  an  NxN  mesh.  Once  again,  we  used  a  problem  sue  of  128,  but 
many  of  the  parallel  sections  in  SIMPLE  do  not  contain  fully  128-way  parallelism.  The  resulting  distribution 
of  work  among  the  64  processors  in  our  simulations  k  uneven.  Sixty-four  processors  k,  however,  the  optimal 
number  of  processors  to  execute  this  application,  given  the  problem  size.  Another  important  difference  in  thk 
application  from  FFT  k  that  SIMPLE  contains  a  number  of  small  and  large  parallel  loops  (20  in  all)  rather 
than  the  few  large  parallel  loops  that  FFT  contains.  SIMPLE  also  contains  many  small  serial  sections  (5)  in 
which  one  processor  executes  the  serial  section  while  all  the  rest  wait  at  the  bottom.  The  resulting  difference  k 
that  SIMPLE  contains  far  more  synchronization  activity  than  FFT.  Parallel  loop  iteration  lengths  in  SIMPLE 
vary  occasionally,  also  contributing  to  more  synchronization  accesses  due  to  more  processor  waiting  at  the  end 
of  a  parallel  loops  with  uneven  loop  iterations.  SIMPLE  would  be  representative  of  a  typical  application  which 
allows  neither  worst-case,  nor  best-case  performance  giving  our  SPMD  computational  model. 

The  WEATHER  code  forecasts  the  weather  by  modeling  the  state  of  the  atmosphere  as  described  by  the 
NASA  GLAS/GISS  fourth  order  general  circulation  model  of  a  three-dimensional  atmosphere  [11],  The  algo¬ 
rithm  breaks  the  atmosphere  down  into  a  three-dimensional  grid  encircling  the  globe  and  computes  the  value 
of  several  interrelated  state  variables  using  finite  difference  methods.  In  the  model  simulated  by  WEATHER, 
the  atmosphere  was  represented  by  nine  regions  of  fixed  altitude  and  a  grid  uniformly  spread  across  longitude 
and  latitude  on  each  layer.  In  the  runs  we  traced,  the  grid  was  108  by  72.  Parallel  sections  of  the  COMP1 
routine,  which  calculates  horizontal  and  vertical  advection  differences  in  the  atmosphere,  were  traced.  The 
load-balancing  in  thk  application  k  far  worse  than  in  FFT  and  SIMPLE,  given  that  it  was  simulated  with  64 
processors.  Since  the  parallelkm  is  derived  by  simultaneously  working  on  rows/columns  of  the  atmosphere  grid, 
and  the  dimensions  of  the  grid  are  not  multiples  of  64,  many  processors  are  fotced  to  idle  in  parallel  sections 
which  are  followed  by  barriers.  Fifty-four  processors  would  be  a  more  optimal  number  of  processors  to  execute 
this  application  with  the  problem  size  used.  Thus  the  load  balancing  in  our  three  applications  showed  a  wide 
range. 
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